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Abstract 

It is well known that Sparse PCA (Sparse Principal Component Analysis) is NP-hard 
to solve exactly on worst-case instances. What is the complexity of solving Sparse PCA 
approximately? Our contributions include: 

1. a simple and efficient algorithm that achieves an n _ ^-approximation; 

2. NP-hardness of approximation to within (1 — e), for some small constant e > 0; 

3. SSE-hardness of approximation to within any constant factor; and 

4. an exp exp (fi (yJoglogn)) (“quasi-quasi-polynomial”) gap for the standard semidefi- 
nite program. 


1 Introduction 

Principal component analysis (PCA) is one of the most popular tools for data analytics. PCA 
operates on data point vectors supported on features, and outputs orthogonal directions (i.e., 
principal components) that maximize the explained variance. A limitation of PCA is that - 
in many cases of interest — the extracted principal components (PCs) are dense. However, 
in applications such as text analysis, or gene expression analytics, having only a few non¬ 
zero features per extracted PC, offers significantly higher interpretabilty. For example, in text 
analysis where PCs are supported on words, if they consist of only a few of them, then these 
words can be used to detect frequently occurring topics. 

Sparse PCA addresses the issue of interpretability directly by enforcing a sparsity constraint 
on the extracted PCs. Given a matrix of centered data samples S € M nxp , let us denote 
by A = ^SS 7 the sample covariance matrix of the data set. The leading sparse principal 
component is the solution to the following sparsity constrained, quadratic form maximization 

max x T Ax (SparsePCA) 

||x|| 2 =l,||x||o<fc 

where ||a ?||2 is the f^-norm and ||x||o denotes the number of nonzero entries in x. 

The objective in the above optimization is usually referred to as the explained variance. This 
metric has an operational meaning: if a linear combination of k features has high explained 
variance, then it captures a representative behavior of the samples. Typically, this means that 
these features “interact” significantly with each other. As an example consider the case where 
A is a covariance matrix of a gene expression data set. Then, the (i,j) entry of A is a proxy 
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for the positive or negative interaction between the zth and yth gene. In this case, if a subset 
of k genes “explains” a lot of variance, then these genes have strong pairwise interactions. 

There has been a large volume of work on sparse PCA: from heuristic algorithms, to statisti¬ 
cal guarantees, and conditional approximation ratios. Yet, there are remarkably few worst-case 
approximability bounds, and many questions remain open. Does sparse PCA admit a nontrivial 
worst-case approximation ratio? Are there significant computational barriers? How does it re¬ 
late to other problems? In this work we take a modest first step towards a better understanding 
of these questions. Our contributions are summarized below. 

1. We show that a simple spectral technique that is popular in practice, combined with a 
column selection procedure, achieves an n _ 1 ^-approximation ratio for SparsePCA. 

2. We establish that, assuming P ^ NP, SparsePCA does not admit a PTAS. 

3. We further prove that, assuming Small Set Expansion (SSE) Hypothesis [45], SparsePCA 
is hard to approximate to within any constant factor. 

4. We construct an e e n( - vloslosn '> a “quasi-quasi-polynomial”) gap instance for the fol¬ 

lowing commonly used SDP relaxation of [21] 

max tr(AX) 

such that tr(X) = 1, 1 T |X|1 ^ k, X ^ 0 

1.1 Discussion of techniques and connections to other sparsity problems 

A recurring theme in our technical discussion is the comparison of SparsePCA to (variants of) the 
Densest fc-Subgraph (DkS) problem: given a graph G , find the k-ve rtex subgraph that contains 
the highest number of edges. Notice that DkS can be stated as a quadratic form maximization, 
similar to SparsePCA: 

max x r Ax (DkS) 

xg{0,l} n ,||x||o<fc 

The connection between the two problems has been observed before. For example, it has 
been noted by many authors that the fc-Clique problem, a decision variant of DkS 1 * , reduces to 
solving SparsePCA exactly, thus the latter is NP-hard. Then, the Planted Clique, an average- 
case variant of DkS, was recently used to establish statistical recovery hardness results in the 
sparse spiked-covariance model [12, 11, 52], 

Then, why are algorithmic and inapproximability DkS results not directly applicable to 
SparsePCA? From a computational standpoint, the most important difference between the two 
problems is the restriction on the input matrix A: In SparsePCA, A is required to be positive 
semi-definite, whereas in DkS, A is required to be entry-wise nonnegative. 

With the above comparison to DkS in mind, we are now ready to discuss our results and 
techniques. 

n -1 / 3 -approximation algorithm Our n _1//3 -approximation scheme outputs the best solution among 
the following three procedures: i) pick the best standard basis vector; ii) pick the largest k en¬ 
tries in any column vector of A; and Hi) pick the largest k entries of the leading eigenvector of 

A. 

Our algorithm is inspired by (but is substantially different from) a combinatorial H (n _1,/3 )- 
approximation algorithm for DkS, due to Feige et al. [27]. The aforementioned ratio for DkS 
was further improved in the same paper to D (n -1 / 3 " 1-6 ), and later to H (n -1 / 4+e ) [13]. It is an 
open question whether similar ideas can improve the approximation guarantees for SparsePCA. 

1 Notice that fc-Clique is an exact variant of both Max-Clique and DkS. By now, the inapproximability of Max- 

Clique is relatively well understood (e.g. [28, 33, 56]), but these results do not translate to inapproximability of 

DkS (or SparsePCA). 
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NP-hardness Our NP-hardness of approximation reduction begins from MAX-E2SAT-d, the 
problem of maximizing the number of satisfied clauses in a CNF formula, where every clause 
contains exactly two distinct variables, and every variable appears in exactly d = 0(1) clauses. 
We set A,i j to be higher if literals i and j satisfy some clause, and a consistent assignment is 
ensured by having large negative values whenever indices i and j correspond to a literal and its 
negation. A PSD matrix is obtained by adding a large multiple of the identity. As we discuss 
below, this last step seems to be the main obstacle to obtaining a stronger inapproximability 
factor. 

Interestingly, this result highlights an important difference between SparsePCA and DkS: for 
the latter, proving NP-hardness to within any constant factor remains a major open problem. 

The PSD challenge The biggest challenge to obtaining inapproximability results for SparsePCA, 
from say DkS, is achieving A !>= 0. One naive approach is to add a large multiple of the identity 
matrix and force diagonal dominance (as we do for our NP-hardness result). Unfortunately, this 
ruins our inapproximability factor: the large entries on the diagonal outweigh the interactions 
between different features. In particular, every vector achieves a reasonably high explained 
variance. 

A second approach to obtain a PSD matrix is by squaring the adjacency matrix. When we 
start from Planted Clique, or other known hard DkS instances (e.g. [14, 1, 34, 16]), squaring the 
adjacency matrix gives weak inapproximability results, as in the case of [12, 11, 35, 52] (see also 
discussion of impossibility results for the sparse spiked covariance model below). To understand 
why, it is helpful to consider random walks on regular graphs. The density of a subgraph is 
proportional to the probability that a length-1 random walk remains in the subgraph. (Thus 
the densest fc-subgraph is also the least expanding /c-subgraph.) Similarly, when we restrict 
A 2 to the same £:-tuple of vertices, the density corresponds to the probability of remaining in 
the subgraph after a random walk of length 2. Intuitively, squaring the adjacency matrices of 
the instances mentioned at the beginning of this paragraph is ineffective, because even their 
dense subgraphs are very expanding: most length-2 walks that start and end inside the densest 
fc-subgraph, take their middle step outside the subgraph. Thus the density of the subgraph has 
only a small effect on the density with respect to A 2 . To overcome this difficulty, we want the 
“good” subgraph to have very small expansion. 

SSE-hardness and SDP gap The Small Set Expansion Hypothesis [45] postulates that it is hard 
to find a linear-size subgraph with a very small expansion. Intuitively, if the expansion of a 
particular fc-subgraph is sufficiently small, then, even after taking two steps, the random walk 
should remain inside the subgraph; thus the corresponding fc-sparse vector should continue 
to give an exceptionally high value for DkS/SparsePCA with A 2 . To formalize this intuition, 
we apply a recent result of Raghavendra and Schramm [43] on the expansion of random walk 
graphs. 

Finally, the gap for the standard semidefinite program for SparsePCA builds on known 
integrality gap instances for SSE, in particular the Short Code graph [10]. Notice that the 
“quasi-quasi-polynomial” factor ( e en( ' /l0El ° sn) j is slightly smaller than polynomial and “quasi- 
polynomial” (e^^ log n )) factors, but much larger than polylogarithmic. 

Additive PTAS To complete the picture of our current understanding of worst-case approxima- 
bility of SparsePCA, let us also mention a recent additive PTAS due to [8]. By additive PTAS, 
we mean that if all the entries of A are bounded in [-1, 1], the optimum explained variance can 
be approximated in polynomial time to within an additive error of ek, for any constant e > 0 
(compare to an optimum of at most k in the case of an all-ones k x k submatrix). In contrast, 
note that a corresponding additive PTAS for DkS is unlikely [16]. 
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1.2 Related work 


Heuristics and algorithms The algorithmic tapestry for sparse PCA is rich and diverse. Early 
heuristics used rotation and thresholding of eigenvectors [32, 29, 17] and LASSO heuristics 
[5, 30]. Then, in [55], a nonconvex i\ penalized approximation, re-generated a lot of interest 
in the problem. A great variety of greedy, spectral, and nonconvex heuristics were presented 
in the past decade [51, 40, 41, 50, 31, 53, 36]. There has also been a significant body of work 
on semidefinite programming (SDP) approaches [21, 54, 22, 23]. Some recent works established 
conditional approximation guarantees for sparse PCA using spectral e-net search algorithms, 
under the assumption of a decaying matrix spectrum [7, 8]. 

Sparse spiked covariance model and recent impossibility results The performance of many algo¬ 
rithms has been analyzed under the sparse spiked covariance and related models. For example, 
under this model Amini et al. [4] develop the first theoretical guarantees for simple thresholding 
and the SDP of [21], Several statistical analyses were carried for more general settings, while 
using a variety of different algorithms, under various metrics of interest [38, 23, 18, 19, 25]. 

In this model, we are collecting samples from a distribution with a covariance matrix that is 
equal to the identity plus a sparse rank-1 matrix (the spike). Our goal is to identify (or detect) 
the rank-1 sparse “spike” from the samples. If we could observe the true covariance matrix the 
algorithmic task would be trivial. However, when the input to this problem is a finite number 
of samples, then there exist sharp information theoretic, and computational barriers on the 
identifiability of the spike. 

A recent celebrated line of works [12, 11, 52], initiated by Berthet and Rigollet, establish a 
gap between the threshold of samples where detection is information theoretically possible, and 
that were it is computationally feasible, assuming hardness of the Planted Clique problem; a 
similar result was also obtained by Krauthgamer et al. [35] with respect to the standard SDP. 
However, these results do not show a significant gap between the optimal value of the primal 
objective (x 2 Ax) and what is achievable by efficient algorithms. In particular, none of these 
results rule out a polynomial time algorithm (in particular, an algorithm does not even attempt 
to detect the spike) that achieves a (1 — o(l))- approximation of the optimal explained variance. 
This weak inapproximability seems to be a fundamental barrier of reductions from Planted 
Clique (see also discussion in Section 1.1). 

We should also mention some recent inapproximability results in the general case where A 
is not necessarily positive semi-definite (PSD) [39]. (Recall that in typical applications A is a 
covariance matrix and thus necessarily PSD.) We note that in this general matrix setting, it is 
even hopeless to determine, in polynomial time, the sign of the optimal value, unless P = NP. 

1.3 Organization 

Our approximation algorithm is described in Section 2. In Section 3 we prove our NP-hardness 
result, and our SSE-hardness result appears in Section 4. Finally, in Section 5 we prove the 
quasi-quasi-polynomial gap for the standard SDP. For completeness, we also briefly describe in 
Section 6 the additive PTAS due to [8], and shortly discuss in Section 7 the case where the input 
matrix is not PSD. 

2 n ^-approximation algorithm 

Theorem 2.1. SparsePCA can he approximated to within n 1 / 3 in deterministic polynomial 
time. 

The rest of this section is devoted to the proof of Theorem 2.1. Our algorithm takes the 
best of two options: a truncation of one of A’s columns in the standard basis, and a truncation 
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of one of A’s eigenvectors. We present and analyze the guarantees for each algorithm, and then 
show that together they give the bound on the approximation ratio. 

Let y* denote an optimum solution to the Sparse PCA instance, and let OPT = y*Ay* 
denote its value. 


2.1 Truncation in the standard basis 

Algorithm 1 For each i € [n], let A. j be the i-th column of A, and let x; be the unit-norm, 
fc-sparse truncation of A.^. That is, let 



Ajj if |Ajj| is one of the k largest (in absolute value) entries of A..j 
0 otherwise 


and Xj = Xj/ ||xj|| 2 . 

Return the best out of all Xj’s and e^’s, where e,; is the i-th standard basis vector. 


Lemma 2.2. Algorithm 1 returns a solution with value V\ (A, k ) ^ 

Proof. First, we claim that for each i, Xj maximizes ejAx, among all feasible (fc-sparse and 
unit-norm) vectors. By Cauchy-Schwartz inequality, for any choice of support S of size k, the 
unit-norm vector that maximizes the inner product with A j is the restriction of A. j to S , 

normalized. The inner product is thus AT; this is indeed maximized when S is the set 

of entries with largest absolute value. 

Now, rewrite y* = y,e, as a linear combination of (at most k) standard basis vectors. By 
Cauchy-Schwartz inequality, we have 

OPT = J> (ejAy*) < \/Y y i\/Y ( e J A y*f '• 

Plugging in = lly* II2 = we S et 

OPT < \/Y ( e I A y *) 2 ^ \fk max ej Ay*. 

In particular, this means that for some i, then ej Ax, ^ OPT /Vk, where x,;, is as defined above. 
Finally, since A 0, we have 

0 ^ (ej - x,;) T A (e* - Xj) = ejAe, : + xj Ax,; - 2ejAxj. 


Rearranging, we get 

max {ejAej, xjAxj} ^ OPT/Vk. 


□ 


2.2 Truncation in the eigenspace basis 

Algorithm 2 Let (vi,Ai) be the top eigenvector and eigenvalue of A. Return the unit-norm, 
/c-sparse truncation of vi. That is, let 



[vi] ■ if [vi] ■ is one of the k largest (in absolute value) entries of [vi] ■ 
0 otherwise 


and x = x/ ||x|| 9 . Return x. 
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Lemma 2.3. Algorithm 2 returns a solution with value V 2 (A, k) ^ — OPT. 
Proof. First, notice that 


x T Avi = Aix T vi = Aix t x = Ai • ||x|| 2 ^ Ai y/k/n, 

where the last inequality follows by the greedy construction of x. Since A ^ 0, it induces an 
inner product over R n . Thus we can apply the Cauchy Schwartz inequality to get: 


Ai yjk/n ^ x T Avi ^ Vx T Ax • y vj Avi = Vx T Ax • \^X\. 
Rearranging, we have 


T . K , 

x Ax ^ — Ai. 

n 


Finally, to complete the proof recall that Ai = Avi ^ OPT since vi maximizes the 
objective function among all (not necessarily /c-sparse) unit-norm vectors. □ 


2.3 Putting it altogether 

Our final algorithm simply takes the best out of the outputs of Algorithms 1 and 2. We now 
have 


V(A,k) = 


max (Vj (A, k ), V 2 (A, k)j 
(Pi (A, fc)) 2/3 • (V 2 (A, k)) 1 / 3 
OPT 2 / 3 k 1 / 3 


k 1 / 3 


n 


1/3 


OPT 1 / 3 = OPT/n 1 / 3 


3 NP-hardness 

Theorem 3.1. There exists a constant £ > 0 such that SparsePCA is HP-hard to approximate 
to within (1 — e). 

Proof. We reduce from MAX-E2SAT-d: given a 2CNF over n variables where every variable 
appears in exactly d distinct clauses, maximize the number of satisfied clauses. 

Lemma 3.2. There exist constants 0 < s < c < 1 and d such that given a MAX-E2SAT-d 
instance over n clauses, it is NP -hard to decide whether at least cn clauses can be satisfied 
( l ‘yes” case), or at most sn (“no” case). 

Lemma 3.2 follows from standard techniques. We briefly sketch the proof below for com¬ 
pleteness. 

Proof sketch of Lemma 3.2. By, e.g. [26], MAX-3SAT-5 is AP-hard to approximate to within 
some constant factor. We can convert each 3SAT clause C = (iV y V z) into 10 2SAT clauses 
(introducing one additional variable he), 

(x) A ( y ) A ( z ) A (he) A (~^x V -<y) A (~^x V ~<z) A (->y V ->z)(x V ~<hc) A (y V ~<hc) A (z V ~'hc) 

with the following guarantee: the optimal assignment to the 2SAT instance satisfies at most 
7 out of 10 clauses for every satisfied 3SAT clause, and at most 6 out of 10 clauses for every 
unsatisfied 3SAT clause (e.g. [49]). 

This establishes the result for MAX-2SAT with bounded degree. Add a linear number of 
variables and trivially satisfied clauses to get a MAX-E2SAT-d instance. □ 
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Given a 2CNF ip, we construct a symmetric 2nx 2n matrix A* 0 ' = A^ 0 ) (ip) as follows: every 
row/column corresponds to a literal of ip- if row i and column j correspond to an assignment 
that satisfies some clause, then A. 0 ^ = 1, and = 0 otherwise. Let y denote the set of 
vectors that correspond to legal assignments to ip, i.e. 

-y _ f |y| 2 = 1 ;y£{o,i/yn} 271 ; 

1/ ' Vi y Xi = 0 y-.xj = 1 /y/n 

By Lemma 3.2 it is NP-hard to distinguish between 

“yes”: maxy T A (0 )y > c “no”: maxy T A^y ^ s. 

y ey yey 

The proof continues by adding the following matrices to A (0 ): a matrix C with large negative 
entries that enforces a consistent assignment; a larger scalar times the identity matrix that 
ensures our input is PSD; and an even larger (yet still constant) scalar times the all-ones matrix 
that guarantees the optimal solution uses a large support. While adding these matrices preserves 
the qualitative properties of the instance, they significantly weaken our inapproximability factor. 

Enforcing a consistent assignment 

Our first step is to enforce consistency using the objective function instead of restricting the 
input to be from y. Let C,;j = —2d if i and j correspond to a literal and its negation, and 
C ij = 0, otherwise. We claim that among all unit-norm vectors z E {0,1 /^/n} 2n , the objective 
z (A(°) + C) z is maximized by some legal assignment z* E y. Assume by contradiction that the 
objective is maximized by some z which assigns 1 /y/n to some variable xy and its negation; since 
z is exactly n-sparse, it must also assign 0 to another variable Xj and its negation. However, 
the objective value can be increased by considering z! which assigns 1 /\/n to Xi and Xj, 0 to 
their negations, and is equal to z everywhere else. Therefore, for Ad) = A* 0 -* + C, we have 

“yes”: max z T A^z ^ c “no”: max z T A (1 ^z ^ s. 

|z | 2 = l |z | 2 = l 

ze{o,i/y/n} 2n zejo.i/Wi} 2 " 


PSD input 

A d) is not a legitimate input to SparsePCA because it is not be positive semi-definite. For¬ 
tunately, A) 2 ) = 3dl + Ad) is positive semi-definite because it is symmetric and diagonally- 
dominant. The identity matrix adds exactly 1 to the objective function for any input. Therefore 
we also have 

“yes”: max z T A (2 )z ^ 3 d + c 

|z| 2 =l 

ze{0,l/Vn} 2n 

Enforcing a (nearly) n- uniform optimum 

Now, we would of course like to replace {0, 1/y/n} 2 ' 1 with the set of all n-sparse vectors, while 
maintaining (approximately) the same optima. Consider the positive semi-definite matrix J = 
11 T ; the objective x T Jx = |x| 2 is maximized by an n-uniform vector in {0,1/yTr} 2 ”. 

We define our final hard instance input matrix to be A 1 - 3 ) = + A^ 2 ), for a sufficiently 

large (but constant) a. As we show below, the objective is now maximized by a vector x that 
is approximately n-uniform. 


“no”: max z T A^'z ^ 3d + s (1) 

Wa=l 

zejo.i/VF} 2 " 
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Formally, observe that A (2 ' induces an inner product over M n ; thus for any |x| 2 = |z|^ = 1 
we can use the Cauchy-Schwartz inequality to get: 

x T A^x — z t A (2 ^z = (x — z) T A^ (x + z) 


^ y (x — z) T A( 2 1 (x — z) • \/(x + z) T A( 2 1 (x + z) 




A® 


|x + z| 2 | x -z| 2 , 


where ||A^|| 2 is the l 2 operator norm of A^ 2 \ and is bounded by: 

2 

A^ = max x t A^ 2, x = 3 d + max x T A (1 ^x < 5d + max x T A^x ^ 6 d. 


W5=i 


l-ll=i 


l-ll=i 


By triangle inequality, |x + zL ^ 2, and therefore 


x 


t A (2) x-z t A (2) z < 12d|x- 


(2) 


Suppose further that z is a rounding of x to {0,1 /y/n} 2n . In particular, supp (x) C supp (z) 
(we have equality if x is exactly n-sparse) and x T z = A z ^ 0. Let us decompose x = A z z + A w w 
where w is a unit-norm vector orthogonal to z (i.e. |w| 2 = 1 and w T z = 0). Since all the vectors 
have unit norm, A 2 +A^ = A 2 |z| 2 +A^ |w| 2 = |x| 2 = 1. We can now write the difference between 
x and z as, 

2 


|2 = (1 - A z ) z + y/l - A 2 • w 

= (1 - A z ) 2 + (1 - A 2 ) 

^ 2 (1 - A 2 ) . 


(3) 


Since supp (x) C supp (z), we also have that supp (w) C supp (z). Thus w T z = 0 is equivalent 
to w T l = 0. We therefore have: 

z T Jz — x T Jx = z T Jz — (A z z + A w w) T J (A z z + A w w) 
z T Jz — (A z z) T J (A z z) 

WT1=0 

= (1 - A 2 ) • z T Jz 


Eg. (3) 


x — z 


I Til n i |2 

I* 1 II2 = W l X _ Z >2 ‘ 


(4) 


Recall that A® = -J + A^ 2 \ Combining Eq. (2) and Eq. (4), we have that for every n-sparse, 
unit-norm x, 


max 

l*ll=i 


5 t A®z. - x T A^x > - • lx - z| 2 - 12d ■ |x - z| 


(5) 


ze{o,i/W*} 

Let a = 144 d 2 / (c — s). Then, 


a I |2 | | 

— • x — z L — 12a • x — z L = 




72 

c — s 
2 

c — s 
2 

c — s 
c — s 


dr ■ |x — z|o — 12d • |x — z| 


36d 2 • |x — z| 2 — Qd ■ |x — z| 2 (c — s) + 


c — s 


c — s 


6d • |x — zL — 


c — s 


c — s 


2 
























Plugging into Eq. (5), we have 


max z^z < max x T A^x ^ -—- + max z T A^z. 

I*la=l l^ll=l 2 M2=l 

ze{o,i/x/H} 2n l x lo^« ze{o,i/^} 2 " 

Finally, by Eq. (1), it is NP-hard to distinguish between: 


“yes”: max x T A^ 3) x ^ a + 3d + c 
Ma=l 

|x| 0 ^n 

□ 


“no”: max x T A-'^x ^ a + 3d + 
Ma=l 
|x| Q ^n 


4 Small-Set Expansion hardness 

Throughout this section, we will consider edge-weighted 1-regular graphs G = (V,E), whose 
adjacency matrix/probability transition matrix G has every row sum equal to 1. 

Recall that for a 1-regular graph G = (V,E) on n vertices, the expansion of S C V is 

* ,cn a \E(S,V\S)\ 

* c(s) =-jsj-’ 

where \E(S, T)\ = XieS denotes the total weight of edges with one end point in S and 

one end point in T. The expansion profile of G is 

3>g(< 5) — min ^g(S). 

S:|Sj^<5n 

Recall the Small-Set Expansion Hypothesis [46] 2 : 

Problem 4.1 (SSE( 77 , 5)). Given a regular graph G = (V, E), distinguish between the following 
two cases: 

1. Yes: Some subset SCI with |Sj = 5n has &g(S) ^ r/ 

2. No: Any set S C V with |Sj ^ 2 5n has &g(S) ^ 1 — 

Conjecture 4.2 (Small-Set Expansion Hypothesis [46]). For any 7] > 0, there is 5 > 0 such 
that SSE(r],5) is NP-hard. 

There is little consensus among researchers whether this conjecture is true. At any rate, 
if the conjecture turns out to be false, significantly new algorithmic or analytic ideas will be 
needed. See e.g. [ 6 , 9] on efforts to refute the conjecture and pointers to the literature. 

It is more convenient to work with the following version of Small-Set Expansion, where in 
the No case the subset size can be an arbitrarily large constant multiple of the subset size in 
the Yes case. 

Problem 4.3 (SSE(r/, 5, M )). Given a regular graph G = ( V,E ), distinguish between the 
following two cases: 

1. Yes: Some subset S C V with |Sj = 5n has &g(S) ^ r/ 

2 This formulation comes from the full version of the paper on Prasad Raghavendra’s homepage. This formu¬ 
lation has a different soundness condition than the one in the conference version of [46] . Furthermore, [48] shows 
that the two formulations are equivalent. 
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2. No: Any set S C V with |S'! ^ M5n has &g(S) ^ 1 — 77 

The following reduction in [48, Proposition 5.8] shows that the two versions of Small-Set 
Expansion are equivalent. 


Claim 4.4. For all 77 , 5 > 0, M ^ 1, there is a polynomial time reduction from SSE( 77 /M, 5) to 
SSE ( 77 , <5, M). 


We note that our statement is slightly different from [48, Proposition 5.8], due to our different 
version of Small-Set Expansion Hypothesis, but the proof of the above claim is the same. 

We will use the following lemma from [44] . Here the lazy random walk Gi azy corresponds to 
staying at the current vertex with probability 1 / 2 , otherwise moving according to the probability 
transition matrix G. Therefore the probability transition matrix is given by Gi azy = (1 + G)/2. 
For any t € N, define the t -step lazy random walk as G[ azy = (G^y)*, and let G[ azy denote the 
corresponding graph. 


Lemma 4.5. [44, Lemma 13] For all t € N and rj,5 € (0,1], 


(6) ^ min 

lazy 




V 

32 ) 



We define PSD-SSE(? 7 , 6) as the special case of Problem 4.1 where the adjacency matrix of 
the graph is positive semidefinite. We now show that this special case is again equivalent to the 
general case. 


Theorem 4.6. For all 77 , d > 0, there is r[ > 0 such that SSE (rj',5) is polynomial-time reducible 
to PSD-SSE(t 7 , 5). 


Proof. Fix 77 , 6' > 0. Thanks to Claim 4.4, it suffices to reduce from Problem 4.3. That is, we 
will show that there are rl > 0 and M > 1 such that SSE(t/, 6, M) is polynomial-time reducible 
to PSD-SSE (t/,<5). 

We will assume 77 ^ 1/2 (if PSD-SSE(7/, (5) is hard then so is the same problem with larger 
77 ). Let t = 128log(1 / 77 ), rf = min(? 7 , 2rj/t) , M = 4 / 77 . 

The reduction takes an instance G of 5SE(r]', 5, M) and outputs Gf azy . The lazy random 
walk matrix Gi azy is positive semidefinite, and hence so is G[ azy . As a result, the output is an 
instance of PSD-SSE. 

Yes case: By [44, Lemma 12], for every subset S, dVy ( S ) ^ fdWS)/2. In particular, if 

^lazy 

G is a Yes case of SSE(r)', 5, M), then for some subset S of size 5n, has < J> G (S') ^ 77 ', and thus 
also <h G t ( S ) ^ 77 . 

^lazy v 

No case: This follows from Lemma 4.5. Indeed, T» G (4<5 / 77 ) ^1— 77 ' ^1 — 77 ^ 1/2 by 
assumptions. Thus 


1 - 


*1 


(4WV 

32 ) 




^ exp(—i/128) ^ 77 . 


By Lemma 4.5, <h G t (6) ^ 1 — 77 , and Cq azy is a No instance of PSD-SSE( 77 , d). □ 

Let us mention that a variant of the previous lemma follows from the techniques of [20, 37], 
and in fact without making the graph lazy at all. 

Given a PSD matrix A of size n, let us define the sparse PCA objective Vala(5) — 
ma X|| x || 2 = 1) || x || 0 ^<5n X T Ax. 

We also need the local version of Cheeger-Alon-Milman inequality [42, Theorem 1.7]. 


10 




Lemma 4.7. Let L = I — G be the normalized Laplacian matrix of a regular graph G on n 
vertex. For any 6 ^ 1/2, let A s = min{x T Lx/x T x | ||x||o ^ Sn}. Then 

& g (<5) ^ \/( 2 ~ Z ) A <5. 


Theorem 4.8. If G is a Yes instance of PSD-SSE(r/, 5), then VaLg(<5) ^ 1 — rj. If G is a No 
instance of PSD-SSE(r/, 5), then VaLg(< 5) ^ ^/l — (1 — rj) 2 . 

Proof. Yes case: Let S' be a subset with |S| ^ 5n and $>q(S) ^ rj. Consider the normalized 
indicator function I 5 : V —>• R for S. I 5 has at most 5n non-zero entries, and by normalization, 
||IsH 2 = 1- Furthermore, 


1 ^G 1 S = 


\S\ 


Zies(l — Z 


■j$s Gy) 


\S\ 


= 1 - 


Ziesjgs Gij 

\s\ 


= 1 -* G (S). 


Therefore VaLg(5) ^ 1 — rj. 

No case: Let x be any hn-sparse vector. Then 


xTGx _ xTLx 

XT X XT X 


< 1 - Xs, 


where \$ is as defined in Lemma 4.7 and satisfies 


\/(2 — A^Aa > 4 >g(£) ^ 1 — V- 

Letting p = 1 — A<q the previous inequality becomes 1 — p 2 = (1 + p)(l — p) ^ (1 — rj) 2 , and 
hence x T Gx/x T x ^ p ^ y/l — (1 — rj) 2 . □ 

Theorem 4.6 implies SparsePCA is hard to solve within any constant factor C. Indeed, let 
rj = min(l — y/l — 1/4C 2 ,1/2). Theorems 4.6 and 4.8 and Conjecture 4.2 imply that given the 
matrix G in the output of Theorem 4.8, it is NP-hard to tell whether Valg(<5) ^ 1 — 77 ^ 1/2, 
or Val g (5) ^ y/l- {l-p) 2 = 1/2 C. 


5 SDP gap 

Recall the SDP for sparse PCA proposed by [24]: 

max tr(AX) 
such that tr(X) = 1 
1 T |X|1 ^ k 
X 0 


( 6 ) 


In this section, we will show that the SDP has a factor exp exp(D(-^/log log n)) gap. 

If A is the adjacency matrix of a graph, then the SDP is essentially identical to the SDP for 
small-set expansion in [47]. Gap instances for the latter problem therefore imply strong rank 
gap for sparse PCA, provided the adjacency matrix is PSD. A typical gap instance for small-set 
expansion SDP is the noisy hypercube of dimension logn with n vertices. It is not hard to see 
that its adjacency matrix leads to (logn)^ 1 ) gap for sparse PCA SDP. Below we use a more 
sophisticated graph G that can be considered as a small induced subgraph of the noisy hypercube 
(even though formally G is not such a subgraph). This will lead to expexp(D(^/loglogn)) gap 
for sparse PCA SDP, where n is the number of vertices in this graph. This gap factor is 
super-polylogarithmic but sub-polynomial. 
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Construction 

The gap instance A for the SDP is derived from the short code graph G from [10], also known 
as the low-degree long code. Its vertex set is the Reed-Muller code RM(m, d) (evaluations of 
polynomials of (total) degree ^ d over F 2 in m variables x±,..., x m ). Two vertices are connected 
if their corresponding polynomials differ by a product of exactly d linearly independent affine 
forms. Call T the collection of all such affine forms. Therefore G is the Cayley graph on 
RM(m, d) with generating set T. 

The matrix A will be the adjacency matrix for continuous-time random walk on G. That 
is, A = for some t ^ 0. Here we denote by G the probability transition matrix for 

the graph G. Therefore G is a matrix where every row and every column sum to 1. As in [10], 
taking a continuous-time random walk significantly reduces the value of the quadratic form for 
sparse vectors. For our application, continuous-time random walk has the additional benefit 
that A is guaranteed to be PSD because A is the exponentiation of a real symmetric matrix. 

It will be more convenient to transform Eq. (6) into the following SDP: 


max Ef(wf, (Aw)j) 
such that Ef(\Vf,Wf) = 1 

E f,g\( w f^g )I < 6 = k / n 


(7) 


The SDPs in Eqs. (6) and (7) are indeed equivalent, because any SDP solution X to Eq. (6) is 
the (scaled) Gram matrix 

X /,ff = ( w D w </)Ab (8) 

of some vectors w f € R n , and vice versa. 

Choice of parameters: m is a free parameter that all other parameters depend on. Let 
5 = l/2 m / 2 be the fractional sparsity parameter. Let rj = ^ 1 /l 41o § 3 ) be the eigenvalue threshold. 
Let £2 = minjei, 1/20}, where eq is the constant from [15, Theorem 1]. Let d = loglog(l/r/) + 
log(l/e 2 ) — 1 be the degree of the Reed Muller code, and let t = 2 d ~ l be the time parameter 

for the continuous random walk. Let n = |RM(m, d)| = 2^ d ^ be the size of A. Here = 
J2r<d (™) denotes the number of ways to choose a subset of size ^ d out of m elements. Let 
k = n/2 m / 2 . 

Proposition 5.1. The SDP in Eq. (7) has a solution of value 1/e = D(l). 

Proof. Let Wf by the standard embedding of / € RM(m,d). That is, w f : F™ —>• R is the 
vector/function such that its ^-coordinate is w f(x) = (—l)7( x ) g {±1} for x € F™. This defines 
a solution to Eq. (8). In Eq. (8) and below, the inner product (■, •) on —>• M is defined as 
(w,w') = E i 6 F ™w(i)w'(i). 

We now verify that X is a feasible solution to the SDP. As a Gram matrix, X is clearly 
PSD. Also 

E/eRM(m,d)( w /i w /) = E/ e RM(m,d)EzeF™[((— l)^) 2 ] = 1, 

and 


E/, 5 eRM(m,d)K w /i w g}| — E/ i5e RM(m,d) E xG ito(—1)7^ 9 ^ 


= E 


h£KM(m,d) 




,(_l)MaO 


(9) 

where in the last equality we let h = f — g. Using Cauchy-Schwarz, the right-hand-side is at 
most 

^E h { E xeFr (-l)M-)) 2 = yjE XtyeFr E h (-l)^)-^y). (10) 

We now analyze the term inside the square root. When x y , 


E h (-\) h D)-Hy) =0 , 
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thanks to pairwise independence of RM(m, d). When x = y (which happens with probability 
l/2 m ), the same expectation is 1. Therefore Eq. (10) is at most l/2 m / 2 , and so is Eq. (9). Then 
X satisfies the sparsity constraint with k/n = l/2 m / 2 . 

We now bound the SDP value. Let (p x (f) — (—l)-^. Then 

E /( W /,H/) = EfEx£F™(<Px){f)(Atp x )(f). 

We claim that (p x is an eigenfunction of A with eigenvalue 1/e. Assuming this claim, the 
right-hand side becomes 

(l/e)-E f E x&T [(<p x {f)) 2 ] = l/e, 
giving an SDP solution of value 1/e. 

We now verify the claim. For every x € F 2 \ the function = (— 1)^^ is an eigenvector 

of G because 

(G¥>*)(/) = E geT (-l )fW~9W = ^(/) • E seT (-l)^). 

It has eigenvalue 

A. 4 E, er (— 1 yW = 1 - 2 p [g(x) = 1] = 1 - 2 1 ~ d . 

g&T 

Since G and A have the same eigenvectors, ip x is also an eigenvector of A with eigenvalue 

= e -t2'-a = l/e n 

Proposition 5.2. Any k-sparse rank-1 solution w : RM(m,d) —> R to Eq. (7) has value 
^ r] + (l/g) 1 ^ 3 ^/kjn. 

Since the proof is quite technical, let us recall main ideas in [10]. Intuitively, the sparse 
PCA instance A has low value for rank-1 sparse vector for the following reason. The inner 
product space V ( G) —> M can be decomposed into a sum of the subspace Vg and its orthogonal 
complement V/~. One can show that Vg does not contain any sparse vector (more precisely, has 
bounded 2-to-4 norm). Therefore any sparse vector must be essentially contained in (i.e. has 
large projection to) V^~. V/- will be the span of eigenvectors of A whose eigenvalues are small, 
say at most a small positive number r/. This ensures all sparse vectors have small objective 
value under the quadratic form, as desired. 

Proof. This is essentially Theorem 4.14 in [10]. Even though their statement only concerns 
{0, l}-valued sparse vectors, their proof also works for real-valued sparse vectors , as we now 
show. 

Setting up the Fourier expansion 

Let M = 2 m and C = RM(m,d). We first think of the elements of C as functions F’ 2 n —> F 2 ; 
later it will be more convenient to think of them as vectors in F^ 1 . For ci, C 2 : F™ —^ F 2 denote 
the inner product ( 01 , 02)2 — X^eF’ 2 n c i( x ) c 2 (x) (mod 2) 

Denote by C^ = {a € F 2 7 | (a, 0)2 = 0 for all c £ C} the orthogonal subspace of C. 

Any function w : C —>• R has a Fourier expansion, as follows. For every coset a + C'~ L € 
F If 1 /C L , we choose an arbitrary representative a in a + C - 1 , and let Xa(f) = (— l)h*’R 2 be its 
character. Its degree is deg R (y a ) = min c ± eC .x|a + c _L |, where |a| denotes the Hamming weight 
(i.e. number of non-zero coordinates) of a. (Do not confuse this degree with the degrees of 
polynomials in the Reed Muller code!) Any function w : C —>• M is a unique linear combination 
of characters {Xa} Q eFf /cM-> 

w (/) = W («)X«(/), 

aGFf/C x 
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where w(a) = (% a , w) is the Fourier transform of w over the abelian group C. 

Set the character degree bound l = e 2 2 d+1 . Consider the subspace Vi = span{x a | 
deg R (xa) ^ of functions of degree at most l. Note that Vi and V^~ are both invariant 
subspaces of A. 

Given any vector w, we expand it as w = wll + where wU € Vi and E V/-. Then 

(w, Aw) = (wll, Awll) + (w~U Aw 1 ). (11) 

Below, we separately bound the contribution of (wll, Awll) and (w 1 ,Aw 1 ). 


The low-degree subspace Vi 

Consider the projection operator Pi to the subspace Vi. The p-to-q norm of Pi is defined as 


II P II — 
II-* i\\p^fq — 


max 

w:C—>■]& 


ll^ w IU 

II w IIp 


where in the case of a function w : C — >• M, we dehne ||w ||p = E xe c[|w(x)| p ] 1//p . 

We use the following bound on the 2-to-4 norm of Pi , from [10, Lemma 4.9]: For any 
^ < (2 rf “ 1 — l)/4, 

11 ^ 112^4 ^ 3 1 ' 2 . ( 12 ) 

For any &:-sparse vector w : C —V K, let S = {x E C \ w(x) / 0} be the set of nonzero 
entries. By Holder’s inequality, 


w IU/3 = Us • w|| 4/3 < ||I5IUIIw|| 2 = (fe/n) 1/4 ||w|| 2 . 


(13) 


Recall that A = e G \ Since (I — G) is PSD, we have that all of A’s eigenvalues are at 
most 1, i.e. I ^ A. Therefore, 


(w 1 , Aw 1 ) < llw 11 III < ||-PUl4 /3 _>. 2 ||w|| 4/ 3, 

Together with ||-PUU/ 3->-2 ^ ||^|| 2 ->-4 [10, Lemma 4.2] and Eqs. (12) and (13), we get 

(wU Awll) ^ 3 l ^/k/n\\ w III = ( 14 ) 

where the last equation follows from 3 ( = 3 e22<i+1 = S 10 ^ 1 /^. 


The high-degree subspace Vjh 

We now bound the second term (w 1 ,Aw 1 ). is a linear combination of characters of 

degree > t. Recall that T, the generating set of G, is the set of products of exactly d linearly 
independent affine forms. Any character Xa is an eigenvector of G because 

(G Xa)(/) = e 9£ t IxM+g)} = E geT [(-i) ( “’ /+9)2 ] = ( —I^’^E^tK-I)^) 2 ] = Xa(m 9 eTXa(g), 

and its eigenvalue is 

A q 4 E geTXa (g) = E ser [(- 1 )^ 2 ] = 1 - 2 P [(a,g ) 2 = 1 ], 

56 T 

We now use a theorem about Reed Muller code testers to bound P 9e T[(a,p) 2 = 1]. An 
important problem in the intersection of coding theory and property testing is as follows: given 
a code C' _L and a word a, query a small number of cr’s bits to decide whether a belongs to the 
code, or is far from the code. By “far” from the code, it is meant that it has a large Hamming 
distance from any U € C'" L . When C is a Reed-Muller code, in particular RM(m, m — d— 1), 
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this is equivalent to testing whether a is a low (m — d — 1) degree polynomial, or far from every 
low degree polynomial. A canonical test for this problem is as follows: pick a random (m — d )- 
dimensional affine subspace S g , and test whether a restricted to this subset is a degree- (m—d—1) 
polynomial. 

It turns out that having degree ^ m—d—1 over S g corresponds exactly to having YlxeS a ( x ) = 
0 (mod 2) [15]. (Proof sketch: any monomial of degree ^ m — d — 1 does not contain at least 
one of the m — d variables, and thus zeros out when we sum modulo 2 over that variable; in the 
other direction, there is only one homogenous full-degree monomial, and it is nonzero only on 
the all-ones input.) 

Furthermore, picking a random (m — d)-dimensional affine subspace S g corresponds precisely 
to picking a random g £ T and letting S g = {x : g(x) = 1}. (This is related to “dual codes”; 
see also [2].) In other words, the test is the same as verifying that (a,g) 2 = 0. 

Bhattacharyya et al. [15] analyze the probability that the above test rejects polynomials 
that are far from the code, i.e. precisely the quantity P 9e ;r[(a, < 7)2 = !]• Recall that the degree 
of Xa was defined as the Hamming distance of a from C' _L . By our assumption that Xa € Vjfi, 
we have that deg R (xa) ^ i = e 2 2 d+l \ that is a disagrees with every cr 1 € C 1 - on at least 
e 2 2 d+1 /2 m = £22 - ( m-rf-1 - ) -fraction of the entries. Therefore, by [15, Theorem 1], we have that 
T g£T[(a,g)2 = 1] > £2- 

As a result, any Xo with deg R (x Q ) ^ £22 d+1 is also an eigenvector of A with eigenvalue 

g a A e -*(i-A a ) e -s 2 2^ ^ ^ 


Therefore 


(w i ,Aw i )= ^ g a w (a) 2 ^ r/ ^ w(a) 2 = 77 Hw^ \\ 2 2 < t?||w|||. (15) 

aeF /C 1 - aeF^f/C 1 - 

deg n(xa)^ deg R 


Finally, Proposition 5.2 follows from Eqs. (11), (14) and (15) and the constraint ||w||| ^ 1. 

□ 

We remark that an alternative proof of the previous proposition (with a slightly different 
bound) can be obtained by combining Theorem 4.14 in [10] and local Cheeger-Alon-Milman 
inequality [42, Theorem 1.7]. 

Theorem 5.3. Let A be the matrix defined above. The SDP in Eq. (7) has an SDP solution 
of value H(l), but any rank- 1 solution has value 1/exp exp(fl(^/loglogn)). 

Proof. The SDP solution is given in Proposition 5.1. On the other hand, Proposition 5.2 
shows that any rank-1 solution has value ( k/n ) n R). Since logn = (^), we have log logn = 
(logm) 2 (l + o m (l)) and ( k/n) n ^ = l/exp(H(m)) = 1/expexp(H(y / IogTogn)). □ 

6 Additive PTAS 

To complete the approximability picture for SparsePCA, we briefly sketch the proof of the 
additive PTAS due to [8]. The algorithm first approximates A with a low-rank sketch, and then 
finds approximate solutions via an e-net search of the low dimensional space. (We note that a 
similar approach was previously presented in [3], for the closely related problem of DkS on a 
PSD adjacency matrix.) 

The existence of a low-rank sketch, due to Alon et ah, is via an application of the Johnson- 
Lindenstrauss Lemma: 
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Lemma 6.1 ([3]). Let A € R nxn be PSD matrix with entries in [—1,1], Then, we can construct 
in polynomial time a PSD matrix A e with rank 0(^r L ) such that 

l[A]ij — [A e ]ij| < e 


for all i,j with high probability. 

The above low-rank approximation to A preserves all fc-sparse quadratic forms to within an 
additive error term: 


x T Ax — x T A e x = 

x^xy ([ A] ij [ A e ] ij ) 

< e 

n n 

X>l£W 




i= 1 j= 1 


(16) 


Since A is PSD, one can rewrite A = B T B, where A’s low-rank property translates to B 
having few columns. Enumerating over an e-net on the low dimension of B now gives, results 
in the following: 


Lemma 6.2 ([8]). Let A^ € M. nxn be PSD matrix of rank d. Then, we can construct a vector 
Xrf, in time 0(e~ d • nlogn), such that 


> (1 - e) • OPT. 

Finally, combining the above two results gives the additive PTAS. 

Theorem 6.3 ([8]). Let A € M nxn be PSD matrix with entries in [—1,1]. Then, we can 
compute in n°^ p 0 ^! 1 / 6 )) time a k-sparse unit norm vector x e such that 


xJAx e > OPT — e • k 


with high probability. 

7 When the input matrix is not PSD 

In this section, we briefly remark that although the SparsePCA optimization problem can be de¬ 
fined when A is not required to be PSD, no meaningful multiplicative approximation guarantees 
are possible (in polynomial time, assuming P NP). 

Theorem 7.1. When A is not positive semi-definite, it is NP -hard to decide whether the 
SparsePCA objective is positive or negative. 

Proof. Let Vala(&) — max|| x || 2=1> || x i| 0 ^; c x T Ax. It is well known that solving the SparsePCA 
exactly is NP-hard even in the PSD case; i.e. it is NP-hard to distinguish between VaLa(&) 2? c 
and VALA(fc) ^ s for some (potentially very close) c < s. Consider the modified matrix 
A' = A — (P]p) I. Conclude that it is NP-hard to distinguish VaLa'(&) ^ and VaLa'(A;) ^ 

2 • LJ 
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